convert : support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization#20539
Conversation
|
Either Please check. |
Yeah something is off. I didn’t properly smoke test due to lack of memory. |
|
@vbooka1 @richarddd Fixed by #20506 |
3530623 to
585e8da
Compare
…P4/FP8 quantization (ggml-org#20539) * support mixed-precision ModelOpt models with per-tensor NVFP4/FP8 quantization * cleanup * fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Adds support for converting mixed-precision ModelOpt models (e.g. nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4) that use per-tensor quant_algo with both NVFP4 and FP8 layers, instead of a single global quant_algo: "NVFP4". NVFP4 tensors (2D scales) are repacked natively while FP8 tensors (1D scales) are dequantized to float.
Fixes: #20504